A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).
The activity rule is given by two steps:
\[a = \sum_{i} w_ix_i, \quad i=1,...,n\]
\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]
(MacKay, 2003)
A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).
The activity rule is given by two steps:
\[a = \sum_{i} w_ix_i, \quad i=1,...,n\]
\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]
(MacKay, 2003)
\[a = w_0 + \sum_{i} w_ix_i, \quad i=1,...,m\]
\[y = y(a) = g\left( w_0 + \sum_{i=1}^{m} w_ix_i \right)\]
or in vector notation
\[y = g\left(w_0 + \mathbf{X^T} \mathbf{W} \right)\]
where:
\[\quad\mathbf{X}= \begin{bmatrix}x_1\\ \vdots \\ x_m\end{bmatrix}, \quad \mathbf{W}=\begin{bmatrix}w_1\\ \vdots \\ w_m\end{bmatrix}\]
(Alexander Amini, 2021)
Vectorized versions: input \(\boldsymbol{x}\), weights \(\boldsymbol{w}\), output \(\boldsymbol{y}\)
\[a = \boldsymbol{wx}\]
(Herzen et al., 2021)
(Shen et al., 2018)
(Karpathy, 2015)
Assume multiple time points.
Assume multiple time points.
Assume multiple time points.
“dog bites man” vs “man bites dog”
Assume multiple time points.
Assume multiple time points.
Folded representation
Unfolded representation
Add a hidden state \(h\) that introduces a dependency on the previous step:
\[ \hat{Y}_t = f(X_t, h_{t-1}) \]
RNNs have what one could call “sequential memory” (Phi, 2020)
Exercise: say alphabet in your head
A B C ... X Y Z
Modification: start from e.g. letter F
May take time to get started, but from there on it’s easy
Now read the alphabet in reverse:
Z Y X ... C B A
Memory access is associative and context-dependent
Add recurrence relation where current hidden cell state \(h_t\) depends on input \(x_t\) and previous hidden state \(h_{t-1}\) via a function \(f_W\) that defines the network parameters (weights):
\[ h_t = f_\mathbf{W}(x_t, h_{t-1}) \]
Note that the same function and weights are used across all time steps!
class RNN:
def __init__(self):
# Initialize weights and cell state
self._h = [...]
self._Whh = [...]
self._Wxh = [...]
self._Why = [...]
def update_cell_state(self, x):
# function is some function that updates cell state
self._h = function(self._h * self._Whh + x * self.Wxh)
def predict(self):
return self._h * self._Why
def update_weights(self, y):
# Calculate error via some loss function
error = loss(self.predict() - y)
# update weights via back propagation...
rnn = RNN()
for x, y in input_data:
rnn.update_cell_state(x)
rnn.update_weights(y)
# Retrieve next prediction
yhat = rnn.predict() \[ \hat{Y}_t = \mathbf{W_{hy}^T}h_t \]
\[ h_t = \mathsf{tanh}(\mathbf{W_{xh}^T}X_t + \mathbf{W_{hh}^T}h_{t-1}) \]
\[ X_t \]
(Olah, 2015)
Note: \(\mathbf{W_{xh}}\), \(\mathbf{W_{hh}}\), and \(\mathbf{W_{hy}}\) are shared across all cells!
(Onnen, 2021)
Partition time series into training and test data sets at an e.g. 2:1 ratio:
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN
model = Sequential()
model.add(SimpleRNN(units=3, input_shape=(time_steps, 1),
activation="tanh"))
model.add(Dense(units=1, activation="tanh"))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary() ## Model: "sequential"
## _________________________________________________________________
## Layer (type) Output Shape Param #
## =================================================================
## simple_rnn (SimpleRNN) (None, 3) 15
## _________________________________________________________________
## dense (Dense) (None, 1) 4
## =================================================================
## Total params: 19
## Trainable params: 19
## Non-trainable params: 0
## _________________________________________________________________
See if you can improve the airline passenger model. Some things to try:
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
(Alexander Amini, 2021)
Errors are propagated backwards in time from \(t=t\) to \(t=0\).
Problem: calculating gradient may depend on large powers of \(\mathbf{W_{hh}}^{\mathsf{T}}\) (e.g. \(\delta\mathcal{L} / \delta h_0 \sim f((\mathbf{W_{hh}}^{\mathsf{T}})^t)\)
In layer \(i\) gradient size ~ \((\mathbf{W_{hh}}^{\mathsf{T}})^{t-i}\)
\(\downarrow\)
Weight adjustments depend on size of gradient
\(\downarrow\)
Early layers tend to “see” small gradients and do very little updating
\(\downarrow\)
Bias parameters to learn recent events
\(\downarrow\)
RNN suffer short term memory
(Olah, 2015)
“The clouds are in the _”
“I grew up in England … I speak fluent _”
Activation function
ReLU (or leaky ReLU) instead of sigmoid or tanh
Weight initialization
Set bias=0, weights to identity matrix
More complex cells using “gating”
For example LSTM
Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) architectures were proposed to solve the vanishing gradient problem.
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)
Remember the important parts, pay less attention to (forget) the rest.
(Olah, 2015)
Information flows in the cell state from \(c_{t-1}\) to \(c_t\).
Gates affect the amount of information let through. The sigmoid layer outputs anything from 0 (nothing) to 1 (everything).
(Cho et al., 2014)
In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.
Purpose: reset content of cell
Purpose: decide when to read data into cell
Purpose: read entries from cell
Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\), where anything from no information (0) to all information (1) passes through the gate.
Purpose: decide what information to keep or throw away
Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\), where 0=forget, 1=keep.
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]
Two steps to adding new information:
Two steps to adding new information:
\[ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c) \]
\[ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t \]
Output is filtered version of cell state.
\[ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]
(Zhang et al., 2021)
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\\ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)\\ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]
\[ x_t \in \mathbb{R}^{n\times d}, h_{t-1} \in \mathbb{n \times h}, i_t \in \mathbb{R}^{n\times h}, f_t \in \mathbb{R}^{n\times h}, o_t \in \mathbb{R}^{n\times h}, \]
and
\[ W_f \in \mathbb{R}^{n \times (h+d)}, W_i \in \mathbb{R}^{n \times (h+d)}, W_o \in \mathbb{R}^{n \times (h+d)}, W_c \in \mathbb{R}^{n \times (h+d)} \]
Modify the airline passenger model to use an LSTM and compare the results. Try out different parameters to improve test predictions.
LSTM with Variable Length Input Sequences to One Character Output
Predict next character in sequence of strings
return_sequences=False) (Brownlee, 2017)